rm(list=ls())
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
I’m going to work through the process of importing, cleaning, and transforming a dataset.
I’ll create a single script that will complete every stage. This means that I can repeat the process, or share it with others.
The script relies on the features of the dplyr
library. I load this first, after cleaning my environment.
rm(list=ls())
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Then, I import my data.
<- "https://www.dropbox.com/scl/fi/a9bad7bh74n9jazge29eg/netball_data_999.csv?rlkey=hattwdksw8ispfi6lymg0ox66&dl=1"
url <- read.csv(url)
data rm(url)
This code snippet contains three separate instructions.
First, I create an object in my environment called [url]. This tells R where my datafile is stored.
Then, I use the read.csv
function to pull the file located at [url] into my environment. Notice that I do this by creating a new object called [data], which is the output of the function read.csv
.
Finally, I remove the object [url] from the environment by calling the rm
function.
Now, I inspect the data that has been pulled from dropbox. I use the head
, str
and summary
functions to get a sense of the data.
head(data) # the first six rows ('observations')
player_id position goals_scored assists turnovers
1 1 Goal Shooter 25 8 3
2 2 Goal Attack 18 12 999
3 3 Wing Attack 10 999 1
4 4 Center 999 22 5
5 5 Wing Defense 5 10 999
6 6 Goal Defense 12 999 2
str(data) # the variable types that R has ASSUMED
'data.frame': 10 obs. of 5 variables:
$ player_id : int 1 2 3 4 5 6 7 8 9 10
$ position : chr "Goal Shooter" "Goal Attack" "Wing Attack" "Center" ...
$ goals_scored: int 25 18 10 999 5 12 999 30 15 7
$ assists : int 8 12 999 22 10 999 6 14 12 5
$ turnovers : int 3 999 1 5 999 2 3 4 999 1
summary(data) # descriptive statistics for each of the variables
player_id position goals_scored assists
Min. : 1.00 Length:10 Min. : 5.00 Min. : 5.0
1st Qu.: 3.25 Class :character 1st Qu.: 10.50 1st Qu.: 8.5
Median : 5.50 Mode :character Median : 16.50 Median : 12.0
Mean : 5.50 Mean :212.00 Mean :208.7
3rd Qu.: 7.75 3rd Qu.: 28.75 3rd Qu.: 20.0
Max. :10.00 Max. :999.00 Max. :999.0
turnovers
Min. : 1.00
1st Qu.: 2.25
Median : 3.50
Mean :301.60
3rd Qu.:750.50
Max. :999.00
Data is often ‘messy’. We’ll cover how to deal with this in more detail in the next few weeks, but for now, assume that we want to represent missing data with the value NA.
In this dataset, ‘999’ represents an missing value.
First, we’ll replace these values with ‘NA’.
Note that R will do this automatically if the element is truly missing.
# Replace 999 with NA in the dataset
== 999] <- NA data[data
The simplest approach to dealing with missing data is to remove any observations with missing data (‘NA’) in any of the columns. We can use the na.omit
function to do this.
# Remove observations with missing data
<- na.omit(data) data_clean
Notice that, following good practice, I haven’t overwritten the original data frame. Rather, I’ve created a new data frame that is a ‘clean’ version of the old one.
Datasets often contain variables that we don’t need. To keep things tidy, I usually delete anything I won’t be using. The advantage of doing this in a script is that, if I change my mind later, I can just edit the script to retain the variable or variables.
# Remove variable [turnovers]
$turnovers <- NULL
data_clean
# I'm going to reintroduce [turnovers]
<- na.omit(data$turnovers)
turnovers_temp
<- inner_join(data, data_clean, by = "player_id")
extracted_players
$position.y <- NULL
extracted_players$goals_scored.y <- NULL
extracted_players$assists.y <- NULL
extracted_players
<- extracted_players
data_clean
rm(extracted_players)
Often, we want to rename variables, especially if the dataset has been provided by an external organisation.
In the following example, I change the name of the variable [position] to [player_position], [assists] to [player_assists] and [turnovers] to [player_turnovers]:
# Rename variables
names(data_clean) <- c("player_id", "player_position", "goals_scored", "player_assists", "player_turnovers")
Last week, we talked about the importance of making sure that R understands the types of variables in our dataset.
In particular, anything that is going to be a grouping variable needs to be identified as a factor.
In our example, [player_position] is a factor, as we can use it to group players into different categories.
# Change variable types
$player_position <- as.factor(data_clean$player_position) data_clean
Then, I can use the str
function to check variable types. Notice that R has identified that there are two levels of the factor [player_position].
# Use the 'str' function to check variable types
str(data_clean)
'data.frame': 3 obs. of 5 variables:
$ player_id : int 1 8 10
$ player_position : Factor w/ 2 levels "Goal Shooter",..: 1 1 2
$ goals_scored : int 25 30 7
$ player_assists : int 8 14 5
$ player_turnovers: int 3 4 1
I’ll now move to creating a new variable based on something done to existing variables.
Calculate a new variable, [data_clean$ratio], which is [goals_scored]/[player_assists].
# Calculate 'ratio' as [goals_scored] divided by [player_assists]
$ratio <- data_clean$goals_scored/data_clean$player_assists data_clean
I might want to round this variable:
$ratio <- round(data_clean$ratio,2) # round the new variable to 2 decimal places
data_cleanhead(data_clean)
player_id player_position goals_scored player_assists player_turnovers ratio
1 1 Goal Shooter 25 8 3 3.12
2 8 Goal Shooter 30 14 4 2.14
3 10 Wing Defense 7 5 1 1.40
Finally, I’ll Write the dataset as a .csv file to an appropriate location.
write.csv(data_clean,"my_netball_data.csv")
Now, create a script that conducts all of the following steps.
Begin by including the following snippet, which directs R to the file location:
<- "https://www.dropbox.com/scl/fi/w6cmx5fgl6y5e1sizeskf/t08_data_b1700_01.csv?rlkey=trab5xa3hpqhf01ruhx1w2ers&dl=1"
url <- read.csv(url)
data rm(url)
Working on the dataframe ‘data’, and with the dplyr
library installed and loaded:
head
, str
and summary
commands to overview the dataset.Use the head
, str
and summary
commands to overview the dataset.
# Overview the dataset
head(data)
str(data)
summary(data)
In this dataset, ‘999’ represents an outlier. Replace these values with ‘NA’.
# Replace 999 with NA in the dataset
== 999] <- NA data[data
Remove any observations with missing data (‘NA’).
# Remove observations with missing data
<- na.omit(data) data_clean
Remove the variable [X] from the dataset.
# Remove variable 'X'
$X <- NULL data_clean
Rename each of the variables to the format [var_id], [var_a] etc.
# Rename variables
names(data_clean) <- c("var_id", "var_a", "var_b", "var_c", "var_d")
Change the variable types as follows: [var_id] = factor, [var_a] = factor.
# Change variable types
$var_id <- as.factor(data_clean$var_id)
data_clean$var_a <- as.factor(data_clean$var_a) data_clean
Use the str
function to check variable types.
# Use the 'str' function to check variable types
str(data_clean)
Calculate a new variable, [var_e], which is the sum of [var_c] + [var_d].
# Calculate 'var_e' as the sum of 'var_c' + 'var_d'
$var_e <- data_clean$var_c + data_clean$var_d data_clean
First, I load the tidyverse
library and clear my environment.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ ggplot2 3.5.1 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
rm(list=ls())
Then, I create a simple dataframe that contains the statistics of a hypothetical football player:
<- data.frame(
football_df player_name = c("John", "Mike", "Lucas", "Eva"),
goals_scored = c(5, 10, 3, 6),
assists = c(4, 6, 2, 3)
)print(football_df)
player_name goals_scored assists
1 John 5 4
2 Mike 10 6
3 Lucas 3 2
4 Eva 6 3
Convert the dataframe to a tibble and print it.
<- as_tibble(football_df)
football_tibble print(football_tibble)
# A tibble: 4 × 3
player_name goals_scored assists
<chr> <dbl> <dbl>
1 John 5 4
2 Mike 10 6
3 Lucas 3 2
4 Eva 6 3
I can add a new player’s statistics to the dataframe and a new column for [games_played] using the rbind
function.
<- rbind(football_df, data.frame(player_name = "Sophia", goals_scored = 7, assists = 5)) football_df
I can add a new variable for all players (observations) called [games_played]
$games_played <- c(10, 12, 9, 11, 10)
football_dfprint(football_df)
player_name goals_scored assists games_played
1 John 5 4 10
2 Mike 10 6 12
3 Lucas 3 2 9
4 Eva 6 3 11
5 Sophia 7 5 10
Using dplyr
, filter out players who’ve scored more than 5 goals.
<- football_df %>% filter(goals_scored > 5)
top_scorers print(top_scorers)
player_name goals_scored assists games_played
1 Mike 10 6 12
2 Eva 6 3 11
3 Sophia 7 5 10
Calculate a new column [goal_per_game] and get the average goals scored by the team.
<- football_df %>%
football_df mutate(goal_per_game = goals_scored / games_played)
<- football_df %>%
avg_goals summarise(mean_goals = mean(goals_scored))
print(football_df)
player_name goals_scored assists games_played goal_per_game
1 John 5 4 10 0.5000000
2 Mike 10 6 12 0.8333333
3 Lucas 3 2 9 0.3333333
4 Eva 6 3 11 0.5454545
5 Sophia 7 5 10 0.7000000
print(avg_goals)
mean_goals
1 6.2
Arrange players by goals scored in descending order.
<- football_df %>%
sorted_df arrange(desc(goals_scored))
print(sorted_df)
player_name goals_scored assists games_played goal_per_game
1 Mike 10 6 12 0.8333333
2 Sophia 7 5 10 0.7000000
3 Eva 6 3 11 0.5454545
4 John 5 4 10 0.5000000
5 Lucas 3 2 9 0.3333333
Group players by position and get the total goals scored for each position.
$position <- c("Forward", "Midfielder", "Defender", "Forward", "Midfielder")
football_df<- football_df %>%
position_goals group_by(position) %>%
summarise(total_goals = sum(goals_scored))
print(position_goals)
# A tibble: 3 × 2
position total_goals
<chr> <dbl>
1 Defender 3
2 Forward 11
3 Midfielder 17
The following activity allows you to practise some of the techniques covered above.
You may also have to do some research to find out how to complete some of the challenges!
Install and load necessary packages (e.g. tidyverse
)
Download the data from a url, and create a dataframe called [data].
https://www.dropbox.com/scl/fi/ank5v48ndxqggw6mqx6dh/football_data_999_with_position.csv?rlkey=2zqwk0tfxkq3mc6ikvxoyu8r6&dl=1
Convert 999 values (missing data) to NA.
Create a new dataframe called [data_clean] by removing observations with missing data.
Convert the above dataframe to a tibble called [football_tibble] and print it.
Delete all objects apart from [football_tibble].
Add a new player’s statistics to [football_tibble].
Using dplyr
, filter out players who’ve scored fewwer than 16 goals.
Calculate a new variable [goal_per_game] and get the average goals scored by the team.
Arrange players by goals scored in descending order.
Finally, group players by position and get the total goals scored for each position.
rm(list=ls())
library(tidyverse)
<- "https://www.dropbox.com/scl/fi/ank5v48ndxqggw6mqx6dh/football_data_999_with_position.csv?rlkey=2zqwk0tfxkq3mc6ikvxoyu8r6&dl=1"
url <- read.csv(url)
data rm(url)
# Replace 999 with NA in the dataset
== 999] <- NA
data[data # Remove observations with missing data
<- na.omit(data) data_clean
<- as_tibble(data_clean)
football_tibble print(football_tibble)
rm(data, data_clean)
<- rbind(football_tibble, data.frame(player_name = "Sophia", goals_scored = 7, assists = 5, player_position = "Forward"))
football_tibble print(football_tibble)
dplyr
, filter out players who’ve scored fewer than 16 goals.<- football_tibble %>% filter(goals_scored > 15)
top_scorers print(top_scorers)
Calculate a new column [goals_and_assists] which is the sum of [goals_scored] and [assists], and get the average goals scored by the team.
<- football_tibble %>%
football_tibble mutate(goal_per_game = goals_scored + assists)
<- football_tibble %>%
avg_goals summarise(mean_goals = mean(goals_scored))
print(football_tibble)
print(avg_goals)
<- football_tibble %>%
sorted_df arrange(desc(goals_scored))
print(sorted_df)
<- football_tibble %>%
position_goals group_by(player_position) %>%
summarise(total_goals = sum(goals_scored))
print(position_goals)